Export documents and assets (WHIT-2419)#3311
Conversation
528572f to
08c347c
Compare
536a1d1 to
2699dab
Compare
b5cb257 to
55870ef
Compare
fc2c1f8 to
2bf11a8
Compare
Content Publisher has two means of designating an edition to be 'political': 1. 'system political': if associated with role appointments or political organisations 2. 'editor political': if the publisher has explicitly marked the document as political via the "Gets history mode" checkbox in Content Publisher's UI. For the purposes of importing these docs into Whitehall, we don't particularly care the _means_ by which a document is considered political. We don't need the overhead of having to map one 'flavour' of political to Whitehall's representation of that 'flavour'. It is enough to just pass the derived boolean from Content Publisher, into Whitehall (which has its own 'political' boolean override equivalent: https://github.com/alphagov/whitehall/blob/f5afce65514b02767df8625ccfa40c681046c95e/db/schema.rb#L444)
Whitehall stores change notes and editorial remarks _per edition_, yet we're only planning to migrate the latest (published) edition out of Content Publisher, for simplicity. So we imagine this property will be used to form one large 'editorial remark' in Whitehall (and potentially as one large entry in the public changenote history too). The TimelineEntry class is doing the heavy lifting here. I've generally followed the same logic as _content_publisher_entry.html.erb in deciding which properties to expose. The way we're testing this behaviour (by layering edition upon edition in a unit test) doesn't appear to have been done before, as we were getting an Uninitialized Constant error on `RevisionUpdater::Image`. Adding a `require_relative` to the `RevisionUpdater` class worked around that.
2bf11a8 to
10a40ae
Compare
There was a problem hiding this comment.
Looks okay to me, few suggestions around redundant or needed data in comments and plenty of musings that might be useful for the import work.
FYI - There are quite a few TODOs in the commit messages, which are actually still to do.
Having done some import work with Tony before, I would suggest going through the exercise of creating both document types in WH from the console (including image and file attachment) to better understand what you actually need and address those TODOs. That is, before merging this in. Otherwise I bet this will be followed up by a few other PRs 😅 #burned_before
Whilst it's nice and clean to keep the export WH-agnostic, I'd probably go one step further and do the required mappings here already (like CP state to WH state, and other specific fields we'd want there). What we learned from the other import work was that, if you keep the export mappings un-opinionated, you just end up having to test both ends and it wasn't worth it as this is actually an export for WH anyway.
Whilst digging through assets I realised that plenty of those which are not strictly on the live edition are still live. Perhaps moving everything in WH takes us one step closer to being able to do a data cleanup in AM one day.
00420e8 to
4943459
Compare
72e617b to
ac61340
Compare
This commit replaces `all_asset_variants_uploaded?` with a simpler `asset_uploaded?` method for both images and attachments. Instead of requiring all variant files (e.g. `s960`, `s300`, etc.) to be present before allowing publishing or previewing, we now consider the asset “ready” if the original file has successfully uploaded. In practice, all variants are generated and uploaded within milliseconds of the original upload, and failures nearly always stem from the original file itself. For example, if Asset Manager has an outage or detects a virus in the uploaded asset, then the original variant will fail to successfully upload, just as the variants will not. If the original variant has successfully uploaded, then we can be highly confident (though can't _guarantee_) that its variants will have to. The minimal risk hypothetical edge case is a risk we can live with to allow for the following benefits: * Supports content with partial variant sets (we will be importing news articles from Content Publisher, which have only the `s960` and `s300` variants). * Enables future support for dynamically generated variants (e.g. via Fastly image resizing) without requiring full variant pre-generation. * Greatly simplifies the codebase and removes brittle assumptions. See alphagov/content-publisher#3311 (comment) for a discussion on the different variant sizes and where they're used. As you can see in this commit, there is a lot of room for improvement and consolidation across Whitehall. We have several implementations of `asset_uploaded?` that are identical across modules. In practice, most of these implementations were already only checking for the 'original' variant, and have now been simplified to make that absolutely explicit. We should revisit this commit later to consider how to extract some of this behaviour out into a shared module so that we're not duplicating identical logic in multiple places.
This commit replaces `all_asset_variants_uploaded?` with a simpler `asset_uploaded?` method for both images and attachments. Instead of requiring all variant files (e.g. `s960`, `s300`, etc.) to be present before allowing publishing or previewing, we now consider the asset “ready” if the original file has successfully uploaded. In practice, all variants are generated and uploaded within milliseconds of the original upload, and failures nearly always stem from the original file itself. For example, if Asset Manager has an outage or detects a virus in the uploaded asset, then the original variant will fail to successfully upload, just as the variants will not. If the original variant has successfully uploaded, then we can be highly confident (though can't _guarantee_) that its variants will have to. The minimal risk hypothetical edge case is a risk we can live with to allow for the following benefits: * Supports content with partial variant sets (we will be importing news articles from Content Publisher, which have only the `s960` and `s300` variants). * Enables future support for dynamically generated variants (e.g. via Fastly image resizing) without requiring full variant pre-generation. * Greatly simplifies the codebase and removes brittle assumptions. See alphagov/content-publisher#3311 (comment) for a discussion on the different variant sizes and where they're used. As you can see in this commit, there is a lot of room for improvement and consolidation across Whitehall. We have several implementations of `asset_uploaded?` that are identical across modules. In practice, most of these implementations were already only checking for the 'original' variant, and have now been simplified to make that absolutely explicit. We should revisit this commit later to consider how to extract some of this behaviour out into a shared module so that we're not duplicating identical logic in multiple places.
This commit replaces `all_asset_variants_uploaded?` with a simpler `asset_uploaded?` method for both images and attachments. Instead of requiring all variant files (e.g. `s960`, `s300`, etc.) to be present before allowing publishing or previewing, we now consider the asset “ready” if the original file has successfully uploaded. In practice, all variants are generated and uploaded within milliseconds of the original upload, and failures nearly always stem from the original file itself. For example, if Asset Manager has an outage or detects a virus in the uploaded asset, then the original variant will fail to successfully upload, just as the variants will not. If the original variant has successfully uploaded, then we can be highly confident (though can't _guarantee_) that its variants will have to. The minimal risk hypothetical edge case is a risk we can live with to allow for the following benefits: * Supports content with partial variant sets (we will be importing news articles from Content Publisher, which have only the `s960` and `s300` variants). * Enables future support for dynamically generated variants (e.g. via Fastly image resizing) without requiring full variant pre-generation. * Greatly simplifies the codebase and removes brittle assumptions. See alphagov/content-publisher#3311 (comment) for a discussion on the different variant sizes and where they're used. As you can see in this commit, there is a lot of room for improvement and consolidation across Whitehall. We have several implementations of `asset_uploaded?` that are identical across modules. In practice, most of these implementations were already only checking for the 'original' variant, and have now been simplified to make that absolutely explicit. We should revisit this commit later to consider how to extract some of this behaviour out into a shared module so that we're not duplicating identical logic in multiple places.
This commit replaces `all_asset_variants_uploaded?` with a simpler `asset_uploaded?` method for both images and attachments. Instead of requiring all variant files (e.g. `s960`, `s300`, etc.) to be present before allowing publishing or previewing, we now consider the asset “ready” if the original file has successfully uploaded. In practice, all variants are generated and uploaded within milliseconds of the original upload, and failures nearly always stem from the original file itself. For example, if Asset Manager has an outage or detects a virus in the uploaded asset, then the original variant will fail to successfully upload, just as the variants will not. If the original variant has successfully uploaded, then we can be highly confident (though can't _guarantee_) that its variants will have to. The minimal risk hypothetical edge case is a risk we can live with to allow for the following benefits: * Supports content with partial variant sets (we will be importing news articles from Content Publisher, which have only the `s960` and `s300` variants). * Enables future support for dynamically generated variants (e.g. via Fastly image resizing) without requiring full variant pre-generation. * Greatly simplifies the codebase and removes brittle assumptions. See alphagov/content-publisher#3311 (comment) for a discussion on the different variant sizes and where they're used. As you can see in this commit, there is a lot of room for improvement and consolidation across Whitehall. We have several implementations of `asset_uploaded?` that are identical across modules. In practice, most of these implementations were already only checking for the 'original' variant, and have now been simplified to make that absolutely explicit. We should revisit this commit later to consider how to extract some of this behaviour out into a shared module so that we're not duplicating identical logic in multiple places.
We expect that we'll drop the `high_resolution` variant at the point of importing into Whitehall, but have kept it there for completion. The existing 960-wide and 300-wide variants match some of the allowed sized in Whitehall, so should be importable. We expect the exported image can be mapped to Whitehall's modelling like so: ``` $ Image.last => #<Image:0x0000ffff63777d98 id: 796660, image_data_id: 258369, edition_id: 1691184, alt_text: <ALT TEXT GOES HERE>, caption: <CAPTION GOES HERE>, created_at: "2025-06-22 17:38:04.000000000 +0100", updated_at: "2025-06-22 17:38:07.000000000 +0100"> $ Image.last.image_data => #<ImageData:0x0000ffff634668c0 id: 258369, carrierwave_image: "ELECTRICITY_BILLS_TO_BE_SLASHED_FOR_BUSINESSES_1__...", # TODO: is there a problem here? created_at: "2025-06-22 17:38:04.000000000 +0100", updated_at: "2025-06-22 17:38:04.000000000 +0100", image_kind: "default"> $ Image.last.image_data.assets => [#<Asset:0x0000ffff6302e7a8 id: 3930905, asset_manager_id: <DERIVE THE ASSET MANAGER ID FROM THE URL AND PUT THAT HERE>, variant: <VARIANT GOES HERE with 's' prefix, e.g. 's300'>, created_at: "2025-06-22 17:38:05.267635000 +0100", updated_at: "2025-06-22 17:38:05.267635000 +0100", assetable_type: "ImageData", assetable_id: 258369, filename: <DERIVE THE FILENAME FROM THE URL AND PUT THAT HERE>, <Asset:0x0000ffff63d33340 ... # and so on TODO: will there be an issue with the missing s216/s465/s630 variants? ```
The original implementation exposed the following properties as well: - isbn - unique_reference - paper_number - parliamentary_session - official_document_type ...but Content Publisher only has 54 file attachments in total, and none of them define values for any of these properties - they are all `nil`. So there is little point in exposing them as part of the export. We imagine the exported attachment can be fairly easily mapped into Whitehall's attachment modelling: ``` whitehall(prod)> Attachment.find_by(attachment_data: 1343266) => <FileAttachment:0x0000ffff766fd5d0 id: 8823005, created_at: "2025-08-29 16:47:37.000000000 +0100", updated_at: "2025-08-29 16:47:37.000000000 +0100", title: <TITLE GOES HERE> accessible: false, # As per https://gds.slack.com/archives/C03D4A72HGE/p1755694877638679 - all non-HTML attachments should have this unchecked as per guidance. isbn: nil, unique_reference: nil, command_paper_number: nil, hoc_paper_number: nil, unnumbered_command_paper: nil, unnumbered_hoc_paper: nil, parliamentary_session: nil, type: "FileAttachment", ...> ``` ...and: ``` AttachmentData.find(1343266) => <AttachmentData:0x0000ffff7671a518 id: 1343266, carrierwave_file: "sample.pdf", # TODO?? content_type: "application/pdf", # TODO?? file_size: 131514, # TODO?? number_of_pages: 6, # TODO?? created_at: "2025-08-29 16:47:37.000000000 +0100", updated_at: "2025-08-29 16:47:37.000000000 +0100", replaced_by_id: nil> ``` Some things to be worked out about the AttachmentData - but I imagine the import script will need to `curl` the attachment and figure out its file size, number of pages and content type.
I've created two rake tasks: 1. `export:live_document_and_assets`, which exports a single document, given its content ID. By default it outputs to STDOUT but will write to a JSON file if a path is provided. 2. `export:live_documents_and_assets`, which exports all of the live documents and assets. By default it outputs to STDOUT but will write a separate JSON file for each exported document if an output directory is provided. Usage (respectively): ``` rake export:live_document_and_assets["4c2ff601-4494-458b-bdc3-7d358122c2ae","./chris/foo.json"] ``` ``` rake export:live_documents_and_assets["./chris"] ```
We're likely to use the former in Whitehall. We won't use the latter (there is no DB field for it).
We need to be able to associate a news article with a government ID to properly enable history mode.
The 'document_history' property wasn't exposing any public
changenotes. We'll revisit document_history in the next commit.
I did consider rolling out my own change_notes implementation:
```
def self.change_notes(document)
change_notes = document.editions.map do |edition|
change_note = edition.revision.metadata_revision.change_note
if change_note.present?
{ change_note:, created_at: edition.created_at, created_by: User.find(edition.created_by_id).email }
end
end
change_notes.compact
end
```
...but on closer inspection, there is a PublishingApiPayload
`History` class that does exactly what we want and is already
fully tested:
https://github.com/alphagov/content-publisher/blob/ef1ee481f5309e6fd010967231ff5b4723db60ca/spec/lib/publishing_api_payload/history_spec.rb#L1
It is enough for our tests to check that that class is called.
That class also handles the backdating of `first_published_at`,
so we're fixing that whilst we're here.
In the initial implementation of this, we'd wrongly assumed that public changenotes were included in the document history displayed to publishers in the UI. We've since defined a separate `change_notes` method for capturing that. That new method also includes the 'backdating' information, and we've updated `first_published_at` to accommodate backdating too, so it's no longer required to be surfaced here. What is still needed is some means of exposing 'internal' change history, i.e. internal notes / editorial remarks, but also the option of carrying over update/publish dates of each revision and the author of each. So in this commit we've renamed the method to better describe the new scope - `internal_history` and have dropped any complexity around backdating.
ac61340 to
50a075b
Compare
This commit replaces `all_asset_variants_uploaded?` with a simpler `asset_uploaded?` method for both images and attachments. Instead of requiring all variant files (e.g. `s960`, `s300`, etc.) to be present before allowing publishing or previewing, we now consider the asset “ready” if the original file has successfully uploaded. In practice, all variants are generated and uploaded within milliseconds of the original upload, and failures nearly always stem from the original file itself. For example, if Asset Manager has an outage or detects a virus in the uploaded asset, then the original variant will fail to successfully upload, just as the variants will not. If the original variant has successfully uploaded, then we can be highly confident (though can't _guarantee_) that its variants will have to. The minimal risk hypothetical edge case is a risk we can live with to allow for the following benefits: * Supports content with partial variant sets (we will be importing news articles from Content Publisher, which have only the `s960` and `s300` variants). * Enables future support for dynamically generated variants (e.g. via Fastly image resizing) without requiring full variant pre-generation. * Greatly simplifies the codebase and removes brittle assumptions. See alphagov/content-publisher#3311 (comment) for a discussion on the different variant sizes and where they're used. As you can see in this commit, there is a lot of room for improvement and consolidation across Whitehall. We have several implementations of `asset_uploaded?` that are identical across modules. In practice, most of these implementations were already only checking for the 'original' variant, and have now been simplified to make that absolutely explicit. We should revisit this commit later to consider how to extract some of this behaviour out into a shared module so that we're not duplicating identical logic in multiple places.
This commit replaces `all_asset_variants_uploaded?` with a simpler `asset_uploaded?` method for both images and attachments. Instead of requiring all variant files (e.g. `s960`, `s300`, etc.) to be present before allowing publishing or previewing, we now consider the asset “ready” if the original file has successfully uploaded. In practice, all variants are generated and uploaded within milliseconds of the original upload, and failures nearly always stem from the original file itself. For example, if Asset Manager has an outage or detects a virus in the uploaded asset, then the original variant will fail to successfully upload, just as the variants will not. If the original variant has successfully uploaded, then we can be highly confident (though can't _guarantee_) that its variants will have to. The minimal risk hypothetical edge case is a risk we can live with to allow for the following benefits: * Supports content with partial variant sets (we will be importing news articles from Content Publisher, which have only the `s960` and `s300` variants). * Enables future support for dynamically generated variants (e.g. via Fastly image resizing) without requiring full variant pre-generation. * Greatly simplifies the codebase and removes brittle assumptions. See alphagov/content-publisher#3311 (comment) for a discussion on the different variant sizes and where they're used. As you can see in this commit, there is a lot of room for improvement and consolidation across Whitehall. We have several implementations of `asset_uploaded?` that are identical across modules. In practice, most of these implementations were already only checking for the 'original' variant, and have now been simplified to make that absolutely explicit. We should revisit this commit later to consider how to extract some of this behaviour out into a shared module so that we're not duplicating identical logic in multiple places.
|
Thanks @lauraghiorghisor-tw - I'm happy enough that the export gives us everything we need for alphagov/whitehall#10635 now. Merging 🎉 |
What
This PR defines two rake tasks that can be used to export content out of Content Publisher:
export:live_document_and_assets, which exports a single document, given its content ID. By default it outputs to STDOUT but will write to a JSON file if a path is provided.export:live_documents_and_assets, which exports all of the live documents and assets. By default it outputs to STDOUT but will write a separate JSON file for each exported document if an output directory is provided.Usage (respectively):
The latter rake task is what we imagine we'll use alongside an import script in Whitehall, to be built in https://gov-uk.atlassian.net/browse/WHIT-2440. Sister PR: alphagov/whitehall#10635
Why
We want to retire Content Publisher. Whilst we could 'withdraw' all content to avoid having to migrate it, we'd then need to make sure we're able to find some way of hacking an update directly into Publishing API in an emergency, and we figured it would ultimately just complicate the publishing landscape even further. Better to export the content out of Content Publisher and back into Whitehall, where it can be managed centrally.
On investigation, there are only around 3000 documents to export, of two different content types, and they're all low value pieces (news articles that have served their purpose already and, as stated above, have already been cleared for mass-withdrawal by our Content colleagues provided we can be sure we can apply edits retrospectively). For that reason, a thorough reproduction of every bit of document history is not necessary here: we want to do a lift and shift of content as best as we can without expending weeks' worth of effort. We're therefore only exporting the latest published edition for each document (no drafts) and are not bothering to transfer over the removed or unpublished-with-redirect items, since these would mean extra complexity for virtually no benefit.
JIRA: https://gov-uk.atlassian.net/browse/WHIT-2419
Follow these steps if you are doing a Rails upgrade.